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Abstract 

Navigational queries for graph-structured data, such as the regular 
path queries and the context-free path queries, are usually evaluated to a 
relation of node-pairs (m, n) such that there is a path from m to n satisfy¬ 
ing the conditions of the query. Although this relational query semantics 
has practical value, we believe that the relational query semantics can only 
provide limited insight in the structure of the graph data. To address the 
limits of the relational query semantics, we introduce the all-path query 
semantics and the single-path query semantics. Under these path-based 
query semantics, a query is evaluated to all paths satisfying the conditions 
of the query, or, respectively, to a single such path. 

While focusing on context-free path queries, we provide a formal frame¬ 
work for evaluating queries on graphs using both path-based query seman¬ 
tics. For the all-path query semantics, we show that the result of a query 
can be represented by a finite context-free grammar annotated with node¬ 
information relevant for deriving each path in the query result. For the 
single-path query semantics, we propose to search for a path of minimum 
length. We reduce the problem of finding such a path of minimum length 
to finding a string of minimum length in a context-free language, and for 
deriving such a string we propose a novel algorithm. 

Our initial results show that the path-based query semantics have 
added practical value and that query evaluation for both path-based query 
semantics is feasible, even when query results grow very large. For the 
single-path query semantics, determining strict worst-case upper bounds 
on the size of the query result remains the focus of future work. 


1 Introduction 

The graph data model is one of the most versatile and natural data models in 
use: graph-structured data is everywhere and examples can be found in family 
trees, social networks, process models, gene networks, XML data, and RDF 
data [D [3 m ng. For querying graphs, many different query languages have 
been developed, proposed, and researched [3 El El El 1131 nil Uni 1^ ■ At their 
core, most graph query languages depend on navigating the graph. This graph 
navigation is usually performed by means of a regular expression that describes 
the allowed edge-labeling of the paths that should be traversed in the graph. 
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As the regular expressions have limited expressive power, we focus on a more 
expressive navigational query language, namely the context-free path queries 
that use context-free grammars to describe the labeling of paths m HU [Ming ■ 

These navigational queries expressed by context-free path queries are usually 
evaluated to a relation of node-pairs (m, n) such that there is a path from 
m to n whose labeling is described by a context-free grammar—the relational 
query semantics, or to the truth value true whenever such a path exists—the 
boolean query semantics. Although many practical problems can be answered 
by navigational queries evaluated under the usual semantics, we believe that 
the relational query semantics and the boolean query semantics are limiting. 
The inability to view the paths of interest hampers the understanding of the 
data, makes query debugging harder, and makes it impossible to answer certain 
practical problems. 

To address the limitations of the traditional query semantics, we introduce 
path-based query semantics. Concretely, we introduce the all-path query seman¬ 
tics and the single-path query semantics. Under the all-path query semantics, a 
query is evaluated to all paths satisfying the conditions of the query, and under 
the single-path query semantics one such path is chosen. The practical usage of 
these path-based query semantics can be illustrated by a simple example: 

Example 1. Consider a collection of family trees represented by a graph in which 
the nodes represent peoples and the edges represent parentOf and childOf rela¬ 
tions (between parents and their children). Consider the context-free grammar 
with the following production rules: 

q 1 —^ parentOf Cl childOf, q i—^ parentOf childOf. 

Using the standard relational query semantics, the query q evaluates to 
the relation of node-pairs (m, n) such that m and n are both fc-th generation 
descendants of a common ancestor. Using the single-path query semantics that 
we propose, the query q evaluates to a path from m to a common ancestor and 
from this common ancestor to n, showing why m and n are both fc-th generation 
descendants of a common ancestor, while, at the same time, showing who this 
common ancestor is. 

Observe that the context-free grammar used in Example [U is well-known to 
not be expressible by a regular expression |23] . Still, this simple example is at 
the basis of practical queries that are used in, for example, bio-informatics m- 

For graph querying, path-based query semantics have only gained limited 
attention. For the regular expressions, Barcel et al. [5] introduced the extended 
regular path queries that have path variables for output. The main focus of 
Barcel et al. is, however, on the use of path variables for expressivity purposes, 
and path-based results are only studied in limited details. Recent work by 
Hofman et al. m provides an alternative to use path-based query semantics 
for debugging: to gain more insight in the behavior of regular path queries with 
respect to the expected behavior, Hofman et al. propose a technique based on 
separability. Although this approach addresses query debugging, it does not lift 
the other limitations of the relational and the boolean query semantics. 

In the setting of model checking using CTL [5] , path-based query semantics 
are widely used. Normally, CTL formulae are evaluated to true or false, indi¬ 
cating if the graph meets or not meets certain conditions. An important ability 
of CTL model checking algorithms is to not only answer CTL formulae with a 
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truth value, but to also answer with a witnesses or a counterexample for this 
truth value. These witnesses and counterexamples are represented by a path 
in the graph that shows why the graph does or doesn’t meet the conditions 
expressed by the CTL formulae. Counterexamples and witnesses also exists for 
other modal logics, such as LTL. These path-based witnesses and counterexam¬ 
ples are especially useful in the analysis of the model checking results. 

In this work we study path-based query semantics. We provide a formal 
framework for evaluating queries on graphs using the all-path query semantics 
and the single-path query semantics. To achieve this, we first show how to 
represent the query result under the all-path query semantics by a context-free 
grammar annotated with node-information, and we show that this context-free 
grammar can be used to derive exactly those paths that are in the query result. 

For the single-path query semantics, we propose to search for a path of 
minimum length. As we can represent the set of all paths by an annotated 
context-free grammar, we reduce the problem of finding a path of minimum 
length matching the query conditions to finding a string of minimum length in a 
context-free language. For deriving such a string of minimum length, we propose 
a novel algorithm. We then proceed with the analysis of this minimum-length 
string derivation algorithm applied to annotated context-free grammars by ana¬ 
lyzing the possible length of minimum-length paths. For annotated context-free 
grammars over the singleton alphabet, we show a close-to-strict worst-case up¬ 
per bound on the length of minimum-length paths that is linear in the number 
of nodes in the graph. For general annotated context-free grammars we show 
that the worst-case upper bound on the length of minimum-length paths is at 
least quadratic in the number of nodes in the graph. 

To test the behavior of the minimum-length path derivation algorithm in 
practice, we performed measurements on an initial implementation. These re¬ 
sults show promise, as the initial implementation shows acceptable performance 
for a range of context-free path queries. 

Organization In Section[21 we present the basic notions used throughout this 
paper. In Section|31 we present the context-free path queries together with their 
usual semantics, and we introduce the all-path and single-path query semantics. 
In Section|4]and Section[5]we introduce approaches to evaluate queries using the 
all-path query semantics and the single-path query semantics, respectively. In 
Section ini we present our results on a small-scale implementation. In Section 0 
we summarize our findings and propose directions for future work. 


2 Preliminaries 

We call a sequence cri... cr„ of symbols a string. The length of string S = 
CTi ...cr„, denoted by IIS'!!, is n. The empty string is denoted by e and we 
usually treat individual symbols as strings of length one. The concatenation of 
two strings Si and S 2 is denoted by Si ■ 82 - If S is a set of symbols, then we 
denote the set of all strings made of symbols from S by E*. 

Definition 1. A graph is a triple G = (Q, E, <5) with Q n E = 0, in which Q is 
a finite set of nodes, E is a finite set of alphabet symbols used as edge labels, 
and JCQxExQisa finite set of labeled edges. 
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If TO S Q and cr e S, then S{m, a) = {n\ (to, ct, n) G denotes those nodes 
that have an incoming edge labeled with a originating at to. If to G Q and 
S G S*, then 



The language of graph G with respect to m,n G Q, denoted by C{G;m,n), is 
defined by 


£(G; m,n) = {S\S G^* AnG 6*{m, 5')}. 


Let G = {Q,Ti,S) be a graph. A path tt = nicri .. .Ui-iai-iUi in G is a 
sequence with, for all I < j < i, (nj,aj,nj+i) G S. We write niirni to indicate 
that TT starts at node ni and ends at node n^. The trace of tt is defined by 
trace( 7 r) = cti ... Ui-i. Observe that traces are strings over the alphabet E. 

Definition 2. A context-free grammar is a triple C = (N, E, P) with NflE = 0, 
in which N is a set of non-terminals, E is a finite set of alphabet symbols, and 
P is a set of production rules0 In the above, a production rule is of the form 
a b c or a I—>■ CT, in which a, b, c € N and a G eI 

Production rules are to be interpreted as rewrite rules: if 5” = • a • ^2 is a 
string with Si,S 2 S (N U E)* and a e N, and if (a i-> S') G P, then S can be 
rewritten into Si ■ S' ■ S 2 by application of a S". We write S —>-p S' if S can 
be rewritten into S' by a finite number of rewrites using production rules in P 
and we write S -Gp S' if S —7>p S' and at least one rewrite step is necessary to 
rewrite S into S' . The language of a context-free grammar C with respect to 
a € N, denoted by £(C; a), is defined by 


C{C-, a) = {S' I S' e E* A a S}. 


3 Context-free path queries 

Let C = (N, E, P) be a context-free grammar with a G N. We say that a is a 
context-free path query. Usually, these queries are evaluated using the boolean 
query semantics or the relational query semantics^ 

1. Using boolean query semantics, the query a on graph G evaluates to the 
truth value of 3m3n £(C; a) fl £(G; to, n) 7 ^ 0. 

2. Using relational query semantics, the query a on graph G evaluates to the 
binary relation {(to, n) \ £(C; a) n £(G; to, n) ^ 0}. 

^Usually, context-free grammars are defined with a dedicated start non-terminal. It is 
straightforward to specialize our results on context-free grammars to the setting with a dedi¬ 
cated start non-terminal. 

^To simplify the presentation, we assume that context-free grammars are in Chomsky Nor¬ 
mal Form m, and we exclude the derivation of e. Unless stated otherwise, it is straightforward 
to generalize our results on context-free grammars to the setting that includes production rules 
of the form a i—s. 

^Commonly, relational query semantics is referred to as path query semantics |14| . To 
avoid confusion with our path-based query semantics, we have chosen for a different naming 
in this paper. 
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We study two alternative ways of evaluating queries on graphs: the all-path 
query semantics and the single-path query semantics: 

3. Using all-path query semantics, the query a on graph G, with respect to 
nodes m, n G Q, evaluates to the set of all paths mnn in G with trace(7r) G 
£(C;a). 

4. Using single-path query semantics, the query a on graph G, with respect 
to nodes m,n G Q, evaluates to a single path m7rn in G with trace(7r) G 
£(C; a) (if such a path exists). 

The following example illustrates the usages of these query semantics. 

Example 2. Let G be a collection of family trees in which the nodes represent 
people and the edges represent familyOf relations (between parents and their 
children). We have the context-free grammar with the following production 
rules: 


q familyOf, q q q. 

Depending on the semantics used, the query q evaluated on G answers various 
questions: 

1. Using boolean query semantics: ‘are there family members in these family 
trees?’ 

2. Using relational query semantics: ‘provide all pairs of people that are re¬ 
lated. ’ 

3. Using all-path query semantics: ‘provide every way in which m and n are 
related. ’ 

4. Using single-path query semantics: ‘provide a proof that m and n are 
related, ’ or ‘show how m and n are related. ’ 

4 Answering queries using all-path query seman¬ 
tics 

If a context-free path query is evaluated on cyclic graphs, then the query result 
can be an infinite set of paths. Hence, before we look into how to answer a query 
using the all-path query semantics, we need to determine how to represent such 
an infinite set of paths using a finite structure. Graphs are strongly related to 
finite automata and it is well-known that the intersection of the language of a 
finite automaton and the language of a context-free grammar is itself a language 
that can be represented by a context-free grammar: 

Lemma 1 (Bar-Hillel et al. [3]). Let C = (N,E,P) be a context-free grammar 
with a G N and let G = (Q,E,5) be a graph with m,n G Q. The language 
£(C; a) n jC(G; m, n) can be represented by a context-free grammar. 

Lemma [T] only guarantees that there is a finite representation of the set of 
all traces of paths mim in graph G with trace(7r) G £(C; a). As several paths 
can have the same trace, the set of traces cannot be directly mapped to a set 
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Figure 1: A social network in which persons (represented by nodes) can have 
friendOf-relations (represented by labeled edges). 

of paths. To allow for a direct representation of the set of paths, we show how 
to construct a context-free grammar that is annotated with node-information 
relevant for the derivation of paths. 

Definition 3. Let C = (N, E,P) be a context-free grammar and let G = 
(Q,E,^) be a graph. We denote triples (a,m,n) S N x by a[m,n]. An 
annotated grammar over (C, G) is a context-free grammar Cg = (Ng,E,Pg) 
in which Ng C N x Q^; each production rule in Pg is of the form a[m,n] i—>■ 
b[TO, o] c[o, n] or a[m,n] cr, with a, b, c € N, m,n,o G Q, and cr G E; and 
that satisfies the following three properties: 

1. a[m, n] G Ng if and only if C{C; a) n >C(G; m, n) 0, 

2. (a[m,n] i—b[TO, o] c[o, n]) G Pg if and only if (a b c) G P, 

3. (a[m, n] I—>■ cr) G Pg if and only if (m, a,n) G S and (a t-)- cr) G P. 

We say that a non-terminal a[m, n] G Ng can derive path tt = nicri... ai-iUi if 
it can derive the string S = ai ... ai-i such that, for each 1 < j < i, the rewrite 
step producing aj used a production rule of the form (b[nj, rij+i] aj) G Pg. 

We illustrate the concept of a annotated grammar with an example: 

Example 3. Let G be the social network visualized in Figure [T] in which the 
nodes represent people and the edges represent /riendO/relations. Alice wants 
to know how she can contact Eve via friends, via friends of friends, and so on. 
Hence, she writes a context-free grammar C with the following production rules 

P: 


q h-)- friendOf, 


q *-5- q q. 


For brevity, we refer to each person by the first letter of their name. The 
annotated grammar over (C, G) has the following non-terminals: 



The production rules Pg of the annotated grammar consists of the production 
rules that correspond to friendOf-edges in the social network: 


q[A, B] !-)• friendOf, 
q[B, D] !-)• friendOf, 
q[D, E] !-)• friendOf. 


q[A, C] !->■ friendOf, 
q[C, E] friendOf, 
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Furthermore, the following production rules of the annotated grammar express 
the combination of paths in the social network to form bigger paths: 


q[A,D]^q[A,B] q[B,D], 
q[A,E]^q[A,C] q[C,E], 
q[B,E]^q[B,D] q[D,E]. 


q[A,E]^q[A,B] q[B,E], 
q[A,E]^q[A,D] q[D,E], 


To produce a path from Alice to Eve, we can use this annotated grammar: 


q[Alice, Eve] 

—^-pgjRewrite q[Alice,Eve] q[Alice,Bob] q[Bob, Eve]} 

q[Alice, Bob] q[Bob, Eve] 

{Rewrite q[Bob, Eve] q[Bob, Dan] q[Dan, Eve]} 
q[Alice, Bob] q[Bob, Dan] q[Dan, Eve] 

—^-pgjRewrite q[Alice, Bob] friendOf, ... } 

friendOf friendOf friendOf. 


The node-information in each annotated non-terminal allows us to conclude that 
there is a path from Alice to Eve of length three, namely the path 


Alice friendOf Bob friendOf Dan friendOf Eve. 


As Example [3] shows, an annotated non-terminal in an annotated grammar 
describes the mapping between the trace that is derived by the non-terminal 
and the first and last node of a path having this trace. 

Proposition 1. Let C = (N, S, P) be a context-free grammar, let G = (Q, E, d) 
be a graph, let Cg = (Ng, E, Pg) be the annotated grammar over (C, G), letmirn 
be a path in G, and let a € IS! be a non-terminal. We have trace(Tr) € £.(C; a) if 
and only if we can derive tt from a[m,n] S Ng- 

As annotated grammars are context-free grammars, one can use existing 
context-free enumeration techniques dn na mils] to produce some of the 
paths represented by the annotated grammar. The efficiency of these techniques 
depend on the size of the annotated grammar. We observe the following worst- 
case upper bounds: 

Lemma 2. Let C = (N, E,P) be a context-free grammar, let G = (Q,E,(5) be 
a graph, and let Cg = (Ng,E,Pg) be the annotated grammar over (C,G). We 
have |Ng| < |N||Q|^ and |Pg| < |P||Q| V min(|N|, |P|)|d|. 

We propose annotated grammars to represent the query result of a query us¬ 
ing the all-path query semantics. Hence, we also need to show how to construct 
such an annotated grammar, which we do next. 

Theorem 1. Let C = (N, E,P) be a context-free grammar, let G = (Q,E,(5) 
be a graph. We can construct the annotated grammar over (C, G) in 0(|N||(5| -|- 


(INjlQIf). 

Proof. We use the context-free recognizer for graphs of Hellings m to construct 
the set Ng = {a[m, n] j £(C; a)n£(G; m, n) 7 ^ 0} in 0(|N||(5|-|-(|N||Q|)^). Using 
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Ng, we can construct 


Pg = {(a[m, n] !->■ ct) I (a H> ct) G P A (m, a, n) G 5} U 

{(a[m, n] n- h[m, o] c[o, n]) | (a i— !> b c) G P A a[m, n], b[m, o], c[o, n] G Ng}. 

For the construction of Pg, we represent Ng by a S-dimensional boolean matrix 
and the set of production rules P by two look-up structures (one for rules of 
the form a i—tj and one for rules of the form a i—^ b c). These structures 
guarantee constant-time lookups for all the parts used in the definition of Pg- 
By the worst-case upper bounds on |Pg|, we conclude that we can construct Pg 
inO(|N||5|-f (|N||Q|)3). □ 

Observe that Theorem [T] also proves Lemma [1] and it does so by a direct 
context-free grammar construction. Usually, Lemma [1] is proven indirectly by 
using a pushdown automaton-based construction |23]. Our direct approach to 
proving Lemma [T] is essential; indeed, it is the direct construction of a context- 
free grammar in our proof that allows us to guarantee the structural properties 
in the result that we need for the derivation of paths. 

5 Answering queries using single-path query se¬ 
mantics 

Although querying for all paths can be useful in certain cases, it is often sufficient 
if the query answer contains a single such path: a single path is much easier to 
comprehend by end users and can already reveal the information end users are 
looking for. As the length of these paths is not necessarily upper bounded, a 
logical choice would be to choose a path that is as short as possible. Preferring 
such a path of minimal length over longer paths can provide additional practical 
value, as the following example illustrates. 

Example 4. Recall Example [3l If Alice used the query q to find out how she 
can get in contact with Eve via friends, friends of friends, and so on, then 
she probably wants to contact Eve without contacting to many other peo¬ 
ple. Hence, the provided answer via Bob and Dan is not optimal. The path 
Alice friendOf Craig friendOf Eve is shorter, and using this path Alice can 
get in contact with Eva by only contacting Craig. 

Towards answering context-free path queries with a single path of minimum 
length, we proceed in two steps. In SectionjO] we develop an approach to derive 
a string of minimum length from a context-free grammar. In Section 15.21 we 
apply this approach to derive strings of minimum length to annotated grammars, 
hence showing how to answer context-free path queries with a path of minimum 
length. 

5.1 Construction of strings of minimum length 

The goal of this section is to provide an approach to finding a string of minimum 
length in a language defined by a context-free grammar. 

Definition 4. If £ is a language, then the min-length of the language £, denoted 
by min ||£||, is dehned by min ||£|| = min{||S'|| | S G £}. 


Mclean et al. |26j showed that a string of minimum length in a context- 
free language can be computed. Their results do, however, not give a practical 
algorithm or complexity results for deriving strings of minimum length. Towards 
such a derivation algorithm, we introduce derivations using deterministic non¬ 
recursive production rules: 

Definition 5. Let P be a set of production rules. We define heads(P) = {a | 
(a h-)- 5") € P}. We define the set of non-terminals derivable from a using the 
production rules in P, denoted by (a)p, as (a)p = {b | b e N A 35'i3S'2 a —!>p 
Si •b-5'2}. A set of production rules P is non-recursive if, for every a € heads(P), 
we have a ^ {a)p. A set of production rules P is deterministic non-recursive if it 
is non-recursive; if, for every a G heads(P), there exists exactly one (a i-A- 5) G P; 
and if a G heads(P) implies that there exists a string S G S* such that a —>-p S. 

Observe that a deterministic non-recursive set P does not provide choices in 
how one rewrites a non-terminal a into a string. As a consequence, P rewrites 
each a G heads(P) into a unique string S G S*. In this setting, we define 
string(a; P) = S. 

Example 5. Recall Example [21 The following set of production rules in the 
annotated grammar is deterministic non-recursive: 

q[A, B] I— friendOf, q[A, C] i—>■ friendOf, 

q[B, D] !-)• friendOf, q[C, E] friendOf, 

q[D, E] 1 -^- friendOf q[A, D] q[A, B] q[B, D], 

q[A, E] ^ q[A, B] q[B, E], q[B, E] ^ q[B, D] q[D, E]. 

Lemma 3. Let C = (N, E, P) he a context-free grammar, and let a G IS be a 
non-terminal with C{C; a) 0. There exists a deterministic non-recursive set 
P' C P such that ||str/ng(a; P')|| = m/n||£(C; a)||. 

Proof (sketch). Let 5* be a string with a -Gp S and US'!! = min ||£(C; a)||. 
Consider the derivation of S. If the derivation a —i>p S has a sequence of 
rewrite steps b —^-p 5'i • b • S' 2 , then, due to S having minimum length, we must 
have Si = S 2 = s. Hence, we can remove all the rewrite steps involved in 
b —^p S'! • b • S '2 and use the rewrite steps used to rewrite b in S'! • b • S '2 instead. 
If the derivation a —J’p S uses distinct production rules c i-A- S'! and c i-G- S 2 , 
then, due to S having minimum length, both S'! and S 2 are rewritten into equal 
length strings in E*. Hence, we can choose one of the two production rules and 
use it for both rewrites of c, the resulting string will have the same length as 
S'. □ 

Corollary 1. Let C = (N,E,P) be a context-free grammar. There exists a 
deterministic non-recursive set P' C P such that for every non-terminal a G N 
with C{C; a) ^ 0, we have ||str/ng'(a; P')|| = m/n||£(C; a)||. 

We say that a deterministic non-recursive set satisfying the conditions of 
Corollary |T] is minimizing. Corollary [T] does not imply that each deterministic 
non-recursive set always produces strings of minimum length. With an example, 
we show that this is not the case: 

Example 6. Recall Example [SI The provided set of deterministic non-recursive 
production rules P' is not minimizing. We have ||string(q[A, E]; P')|| = 3, while a 
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shorter string of length two exists (via Craig). By replacing the production rule 
for q[Alice,Eve] by q[Alice, Eve] q[Alice, Craig] q[Craig, Eve], the resulting 
set of production rules is minimizing. 

Given a minimizing set of production rules P', it is straightforward to pro¬ 
duce a string of minimum length for each non-terminal a £ heads(P') that has 
such a string. The worst-case complexity of producing these strings is domi¬ 
nated by the length of the produced strings. By using the restrictions put on 
deterministic non-recursive sets, we can provide the following worst-case upper 
bounds on the length of strings of minimal length: 

Proposition 2. Let C = (N, E, P) be a eontext-free grammar and let N = {a \ 
a £ N A C{C; a) ^ 0} he the set of non-terminals that define a non-empty lan¬ 
guage. We have maxagAr(^'h 11/1(C; a)jj) < 2l’^l 1 and “)ll) - 

2lN| _i. 

Proof (sketch). Let jN] = i and let cr £ S. We only have to consider the case 
where P is a deterministic non-recursive set. Hence, we can order the non¬ 
terminals such that N = {ag,..., a^-i} and we use the production rules ag >-->■ cr 
and SLj I—a_,_i aj_i, for all 1 < j < i — 1. □ 

As there exists a straightforward procedure to efficiently construct strings 
of minimum length from a minimizing set of production rules, we only need a 
procedure to construct such a set of production rules for a given a context-free 
grammar. Algorithm [I] provides such a procedure. 


Algorithm 1 Construct a minimizing set of production rules for C = (N, E, P) 
1: P', cost := empty mapping, empty mapping 
2: new is a min-priority queue 
3; for all (a cr) £ P do 
4; if a ^ east then 
5: cost[a], P'[a] := 1, (a i—>• cr) 

6; add a to new with priority 1 
7; while new 0 do 

8: take a with minimum priority in new, and remove it from new 

9: for all (c I— a b) £ P with b £ cost do 

10: produce(c h> a b) 

11: for all (c i-£ b a) £ P with b £ cost do 

12: produce(c h> b a) 

13: return {P'[a] j a £ P'} 

Procedure PRODUCE(d e f): 

14: if d ^ cost then 

15: cost[d], P'[d] := cost[e] -|- cost[f], (d M- e f) 

16: add d to new with priority cost[e\ cost[f] 

17: else if cost[d] > cost[e\ -\- cost\f\ then 
18: cost[d], P'[d] := cost[e] -1- cost[f], (d h-:- e f) 

19: lower priority of d in new to cost[e] -|- cost[f] 


Proposition 3. Let C = (N, E, P) he a context-free grammar. Algorithm [7] 
applied on C produces a minimizing set of production rules for C. 
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Proof (sketch). The main while-loop maintains the following invariants: 

1. If a € P' and P^[a] = (a i—>• b c), then b, c € P'. 

2. If a G P' and P'[a] = (a (-;■ b c), then cost[a] > cost[b] + cosi[c], cost[a] > 
cost[h], and cost[a] > cost[c]. 

3. If a G P', then ||string(a; S’)!! < cost[&]. 

4. Let m be the priority of the last element removed from new. No new 
element is inserted in new with priority less than or equal to m. 

5. Let m be the priority of the last element removed from new. For all a G N 
with min ||£(C; a)|| < m, we have cost[a] = min ||i3(C; a)||. 

As each non-terminal is added to new at most once, Algorithm [T] terminates. At 
termination, Invariants [THS] guarantee that the resulting set of production rules 
{P'[a] I a G P'} is minimizing. □ 

We observe that we cannot straightforwardly generalize Algorithm [T] to the 
setting that includes production rules of the form a i->- e: Invariants [H SI and [5] 
of the proof of Proposition [3] no longer hold (as they all require a strict ordering 
of the cost of non-terminals). This issue can be resolved by maintaining a 
timestamp on each non-terminal (such that a non-terminal has timestamp i if 
it was the i-th change to P'), and change the relevant invariants to not require 
a strict ordering on the cost of non-terminals, but a strict ordering on the pairs 
(cost, timestamp) of non-terminals. 

Theorem 2. Let C = (N, I],P) be a context-free grammar. Algorithm\^ con¬ 
structs a minimizing set of production rules for C m (!I(|N|(|N| log(|N|) -|- |P|)). 

Proof. We represent costs as an array holding |N| integers. The costs used in 
cost and new are integers in the range 1,..., We can represent each of 

these integers using log(2l'^l) = |N| bits. The initialization steps perform C)(|P|) 
steps. The while-loop will, in the worst case, visit every non-terminal once. For 
each of these non-terminals, one insertion into and one removal from the priority 
queue new is performed. The inner /or-loops will visit every production rule 
twice, causing at most 2|P| decrease key operations on priority queue new. 
When using a Fibonacci heap for a priority queue holding at most e elements, 
each insert and removal costs (!I(log(e)) and each decrease key operation costs an 
amortized 0(1) heap operations [TOl[T^. Hence, a total of 0(|N| log(|N|) + |P|) 
heap operations are performed. Taking the size of the integers representing 
priorities into account, the heap operations cost 0(|N|(|N| log(|N|) -|- |P|)). □ 

Using Theorem [5] and Proposition [31 we conclude the following: 

Corollary 2. Let C = (N, E,P) be a context-free grammar, let N = {a | 
a G N A C{C] a) 0} be the set of non-terminals that define a non-empty 
language, and let L = ®)ll combined length of a string 

of minimum length for each non-terminal in N. We can construct strings of 
minimum length for all non-terminals in N in 0(|N|(|N| log(|N|) -t- |P|) -|- L) = 
0(|N|(|N|log(|N|) + |P|) + 2lN|). 
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5.2 Construction of paths of minimum length 

We can already answer queries with single paths of minimum length by first 
constructing an annotated grammar and then applying Algorithm [ 1 ] This ap¬ 
proach has high overhead due to the explicit construction and storing of the 
annotated grammar. To reduce this overhead, we adapt Algorithm [T] to the 
setting of query evaluation using the single-path query semantics. The result¬ 
ing algorithm, Algorithm [2J operates on a normal context-free grammar and a 
graph, and derives the necessary details of the annotated grammar in place. If 
necessary. Algorithm [2] can use straightforward bookkeeping to also construct 
Ng and Pg, this without increasing the asymptotic complexity of the algorithm. 


Algorithm 2 Construct a minimizing set of production rules for the annotated 
grammar over C = (N, E, P) and G = (Q, E, 6) 

1: P', cost := empty mapping, empty mapping 
2: new is a min-priority queue 
3: for all (a cr) e P and (m, a,n) £ S do 
4; if a[m, n] ^ cost then 

5: cost[a.[m, n]], P'[a[TO, n]] := 1, (a[m, n] i— 7 > cr) 

6; add a[m, n] to new with priority 1 
7; while new 7 ^ 0 do 

8: take a[m, n] with minimum priority in new, and remove it from new 

9: for all (c I—a b) e P with b[n, o] £ cost do 

10: PRODUCE(c[m, o] i—>■ a[77i, n] b[n, o]) 

11: for all (c b a) e P with b[o, m] £ cost do 

12: produce(c[o, n] 1 -^-b[o, to] a[m, n]) 

13: return {P'[a[TO,n]] | a[TO,n] S P'} 

Procedure PRODUCE(d[M, w] :—)■ e[u,r] 

14: if d[it, w] ^ cost then 

15: cost[d[it, wj], P'[d[M, u>]] := cost[e[u,v\] + cost[f [u, wj], (d[u, ic] M- 

e[u, z)] f [d, zcj) 

16: add d[zt, w] to new with priority cost[e[u, z;]] -|- cost[f [z;, zcj] 

17: else if cost[d[zi, zcj] > cost[e[zi, z)]] -I- cost[f [z;, zcj] then 

18: cost[d[zt, zzi]], P'[d[zi, zcj] := cost[e[zi, z;]] -I- cost[f [z;, zc]], (d[zi, zc] M- 

e[zi, zi] f [u, zcj) 

19: lower priority of d[zi,zz;] in new to cost[e[zz, rj] -I- cost[f [z;, zcj] 


Before we fully analyze Algorithm [2j we use Lemma [2] and Proposition [2] to 
conclude the following naive worst-case upper bound on the length of paths of 
minimal length: 

Corollary 3. Let C = (N, E, P) be a context-free grammar with a G N and let 
G = (Q, E, 6) be a graph with m,n £ Q, such that C — £(C; a) fl £(G; m,n) 7 ^ 0. 
We have m/n||£|| < . 

Using Corollary [3j we conclude the following: 

Proposition 4. Let C = (N,E,P) be a context-free grammar and let G = 
(Q, E, 6) be a graph. Algorithm\^ constructs a minimizing set of production rules 
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for the annotated grammar Cq — (Ng,S,Pg) over {C,G) m 0 (|Ng|(|Ng| log(|NG|)+ 
|Pg|)), and, hence, zn 0(|N||Q|2((|N||Q|2) log(|N||Q|2)+|P||Q|^+min(|N|, |P|)|5|)). 

Corollary 4. Let C = (N, S, P) be a context-free grammar with a G N and let 
G = (Q, S, S) be a graph with m,n € Q, such that C = i3(C; a) n i3(G; m,n) ^0. 

We can construct a path mirn such that trace(Tr) € C and ||trace(7r)|| = min ||i2|| 
in 


C(|N||Q|^((|N||Q|^)log(|N||Q|^) + |P||Q|Vmin(|N|,|P|)|5|) + ||trace(7r)||) = 
0(|N||Q|2((|N||Q|2)log(|N||Q|^) + |P||Q|3 + min(|N|, |P|)|(5|) + 

Observe that the upper bound of Corollary [3] is very loose: Proposition [2] 
depends on production rules of the form a —>■ b b, whereas, in general, annotated 
grammars only allow for such production rules in very restricted cases. Hence, 
we look at ways to improve the worst-case upper bound observed by Corollary^ 
As an initial step, we consider languages defined over singleton alphabets: 

Proposition 5. Let E he an alphabet with |E| = 1, let C = (N, E,P) be 
a context-free grammar with a G N, and let G = (Q, E, 6 ) be a graph with 
m,n G Q, such that C = C{C; a) D C{G;m,n) ^ 0. In the worst case, we have 
|Q|2lN|-i < min\\C\\ < |Q|(22|n|-i + i). 

Proof. First, we prove the lower bound. Let E = {cr}, let Q = {no,..., n|Q|_i}, 
let 6 = {{ni,a,ni+i mod |Q|) I 0 < i < |Q| - 1}, let N = {ao,..., a|N|_i}, and 
let P = {ao I—>• cr, ao I—ap ap} U ja^ i—> a^-i a^-i | 1 < i < |N| — 1}. 

With these definitions we have £{G;ni,ni) = (cr^l'^l | 0 < k}, for every 1 < 
i < |Q|, and we have C{C; ap) = jcr^ | 1 < k}. Hence, the string S' of minimum 
length such that ao[ni,ni] —7>p S' is S' = Each a.j, 1 < j < |N| — 1, will 
be rewritten in a string of exactly 2-' ag non-terminals. Hence, the string S'' 
of minimum length such that a.j[ni,ni] —S' is S' = and we conclude 

||S[i^l_,|| = |Q|2lN|-i. 

The upper bound is proven using a result of Pighizzini et al. EZl: for each 
context-free grammar C with y non-terminals, there exists a finite automaton 
G' = (Q', E,(5') with initial state m, final states F, and with |Q'| < 2^1^!“^ -|- 1 
such that (lJnG_F ^)) = a). We use G' to represent C and we apply 

the well-known product construction for the intersection of finite automata on 
G' and G. The resulting finite automaton has |Q||Q'| = -|- 1) states, 

proving the upper bound. □ 

Due to Proposition [SJ we can conclude that in the case of unlabeled graphs, 
the complexity of query evaluation with the single-path query semantics is poly¬ 
nomial in terms of the graph size and exponential in terms of the query size. 
Observe, however, that the exponential complexity in terms of the query size 
follows straightforward from the succinctness of context-free grammars (as com¬ 
pared to regular expressions and finite automata). 

In the labeled case, we can still use Proposition [S] to get worst-case lower 
bounds on min ||£(C;a) n£(G;m,n)||. The worst-case upper bound provided 
by Proposition[5]can, however, not be generalized to arbitrary alphabets, which 
we show next. 
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Figure 2: The double-cyclic graph: two cycles, one having u edges labeled with 
CTi, and one having v edges labeled with a 2 - The two cycles are connected via 
a shared node c. 


Proposition 6. Let C = (N, S,P) he a context-free grammar with a G N, and 
let G (Q, E, 5) he a graph with m,n G Q, such that C = C{C] a) n£(G; m, n) ^ 
0. In the worst case, we have < 64m/h||£||. 

Proof. Choose the well-known context-free language £ = {cri^(T 2 ^ | 1 < k}. The 
context-free grammar C = (N, E, P) with N = {a, a', ai, a 2 , bi,..., b|isr|_ 4 } and 

P = {a (—5- ai a 2 , a i—^ ai a^, a^ i—a a 2 , ai i-^ (Ti, a 2 (T 2 } U 
{bi h-)- a a} U {bj bj_i bj_i | 1 < J < |N| — 4} 

has £(C; a) = £. Choose a k with 1 < A: and choose \Q\ = u -\- v — 1 with u = 
2^ -I- 1 and v = u—\. Let G = (Q, E, 5) with Q = {c, mi,..., m„_i, ni,... n„_i} 
and 


5 = {(c, cri,TOi),(m„_i,cri,c)}U{(mj,cri,TOi+i) I 1 < i < u - 1}U 
{(c, cr 2 ,ni), (n„_i,(T 2 ,c)} U {{ni,a 2 ,ni+i) | 1 < i < u - 1}. 


The resulting graph is visualized in Figure [5] 

Let CTTC be a path in G with trace(7r) G £(C; a). Due to the definition of a, 
we must have trace(7r) = Si ■ S 2 with = (Ti“ and S 2 = cr 2 ^, for 1 < a;. Due 
to the structure of the graph, Si must be the trace of a path cttiC and S 2 must 
be the trace of a path C7r2c in graph G. From these constraints, we conclude 
£(C;a)n£(G;c,c) = | 1 < fc}. Observe that n = 2^= and 

n = 2^ -|- 1 are coprime, hence, we have lcm(u, v) = uv. 

Each \>j, 1 < j < |N| — 4, will be rewritten in a string of exactly 2-1 a non¬ 
terminals. As only node c has outgoing edges labeled with both ai and 02 , each 
bj[c, c] will be rewritten in a string of exactly 2^ a[c, c] non-terminals. Hence, 
we conclude min ||£(C; hj) 0 £(G; c, c)|| = uv2^, and we conclude 


2|N| 

min ||£(C; b|N|_4) n £(G; c, c)|| = > v^—— 

'I 1 


/|Q|y 2lN| _ |Q|22lN| 
V 2 j 16 “ 64 

□ 


For labeled graphs, we do not yet have a better worst-case upper bound than 
the naive upper-bound provided by Corollary [31 

Open Problem 1. Let C = (N, E, P) be a context-free grammar with a S N, and 
let G = (Q, E, (5) be a graph with m,n G Q, such that £ = £(C; a) n£(G; to, n) ^ 
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0. What is the strict worst-case upper bound on min ||£||? Or, equivalently, what 
is the strict worst-case upper bound on the length of a shortest string in the 
intersection of the language of a context-free grammar with x non-terminals and 
the language of a finite automaton with y states? 

We conjecture that, as in the unlabeled case, the complexity of query evalu¬ 
ation with the single-path query semantics is polynomial in terms of the graph 
size and exponential in terms of the query size. 


6 Experimental results 

To provide insight in the practical behavior of path-based query evaluation, we 
have implemented algorithms for the evaluation of queries using the single-path 
query semanticsQ We primarily focus on the running time of Algorithm [2j as 
the cost of producing the paths of interest heavily depends on whether one wants 
to produce a path for a particular node pair or for all node pairs. We perform 
three different tests: 

1. We compare two context-free grammars that both evaluate to the positive 
transitive closure (under the relational query semantics): 

I— a qj^, qj^ i—cr, a i—cr; 

q2 q2 q2> q2 '-5’ 

Observe that the context-free grammar q^^ is linear and non-ambiguous, 
whereas the context-free grammar with non-terminal q 2 is non-linear and 
highly ambiguous. We measure the running time of constructing mini¬ 
mizing sets of annotated production rules for these queries on the cyclic 
graphs of Proposition [S] 

2. We compare the context-free grammar q^^, which produces dense result 
sets, with the language C = {crcrcr}, which produces sparse result sets. We 
measure the running time of constructing minimizing sets of annotated 
production rules for these queries on the cyclic graphs of Proposition [5) 

3. We derive the longest path of minimum length that matches the following 
context-free grammar: 

q i—a q', q^ i—q b, q a b, 

a i—(Ti, b 1—;■ cr 2 - 

Similar context-free grammars are used in Example [I] and Proposition [51 
We evaluate these queries on the double-cyclic graphs of PropositionlHl for 
which we know that the query q will produce paths with a high minimum 
length. We measure both the running time of constructing a minimizing 
set of annotated production rules for q on the double-cyclic graphs, and 
the running time for deriving the longest path of minimum length from 
the resulting minimizing set. 

^The algorithms are implemented in C++. Measurements where performed on a system 
with an Intel Core 15-4670 CPU, running at a maximum of 3.8GHz, and with 16GB of main 
memory. The source code will be made available under an open-source license. 
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Figure 3: Results of test measurements on Algorithm [5] and on deriving paths 
of minimum length. The horizontal axis displays the size of the graph (|Q|), the 
vertical axis displays the running time (s), and the plot titles matches with the 
test descriptions in Section [6] 


We remark that these tests illustrate extreme behavior: the queries apply to 
the entirety of the graph and all nodes and edges will participate in the outcome. 
This does not reflect all practical applications, where one can often expect that 
queries are much more selective. 

The measurements for these three tests are summarized in Figure [3l On the 
one hand we see that Algorithm [2] can evaluate queries on large graphs, even if 
the resulting paths are large or if many paths are produced. For example, query 
q evaluated on double-cyclic graphs of 4750 nodes gives a total of 11 • 10® paths, 
where the longest path consists of 11 • 10® edges, this while the running time 
of Algorithm [2] is only 4.3s and the longest path is derived from the resulting 
minimizing set in only 1.5s. Hence, in this case, the cost for answering query q 
using the single-path query semantics is at most 5.8s. 

On the other hand, we see that the performance of Algorithm [5] is heavily 
influenced by the ambiguity of the context-free grammar. We see that query 
qj^ evaluates magnitudes faster than query q 2 , even though the context-free 
languages underlying these queries are equivalent. The measurements on q^^ 
and C show that evaluating C is faster, which is unsurprising as £ is a much 
simpler query. Still, the difference in running time for these two queries is 
relatively small. 

7 Conclusions and future work 

To address the limits of the traditional query semantics for navigational query 
languages such as the context-free path queries, we proposed path-based query 
semantics. We studied two such path-based query semantics, namely the all¬ 
paths query semantics and the single-paths query semantics, and we provided a 
formal framework for evaluating queries on graphs using both path-based query 
semantics. Our initial results show that the path-based query semantics have 
added practical value and a small-scale experiment on an implementation of the 
main query evaluation algorithms show that query answering is feasible, even 
when query results grow very large. 

In conclusion, we believe that our work opens the door for further study of 
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path-based query semantics. Besides the open problem already stated in this 
work—determining strict worst-case upper bounds on the size of the query result 
under the single-path query semantics—several other directions for future work 
have our interest. 

1. Can we use more efficient algorithms for the problems outlined in this 
paper? Can we, for example, apply techniques used for context-free pars¬ 
ing [m [30] or for Datalog query evaluation (adzi? 

2. All algorithms outlined in this paper are bottom-up. Can we derive top- 
down algorithms or, in general, goal-oriented algorithms for answering 
queries for a given pair of nodes? 

3. Our measurements showed that two different context-free grammars for 
the same context-free language can have huge differences in the running 
time for query evaluation. Can we optimize context-free grammars to 
guarantee better performance? Can we provide more efficient query eval¬ 
uation for deterministic or for unambiguous context-free grammars? 

4. Are there approximation algorithms for evaluating queries using the single¬ 
path query semantics that guarantee to produce paths whose length is 
close to the length of paths of minimum length, while having a much lower 
complexity? Our initial work on this topic shows that straightforward 
naive methods exist to efficiently produce a deterministic non-recursive 
set of production rules. Although such deterministic non-recursive sets of 
production rules guarantee a worst-case upper bound on path lengths, the 
length of the resulting paths is not necessary close to optimal. 

5. To which extent can we adopt path-based query evaluation such that it 
exploits parallel hardware, distributed computing, and/or specialized ac¬ 
celeration hardware? 

6. Can we generalize path-based query semantics to query languages that 
do not query based on path structures, but query based on patterns in 
graphs (such as Datalog and the navigational expressions [H]), and can we 
provide efficient query evaluation for such graph-based query semantics? 
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